Skip to content

feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110

Open
coketaste wants to merge 17 commits intodevelopfrom
coketaste/v2-rocm-path
Open

feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
coketaste wants to merge 17 commits intodevelopfrom
coketaste/v2-rocm-path

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

@coketaste coketaste commented Apr 22, 2026

Summary

Adds automatic ROCm install root resolution for the run phase, with independent host and container path handling.

What changed

  • New helpers: madengine.utils.rocm_path_resolver and madengine.utils.therock_markers (shared TheRock share/therock file markers).
  • Host precedence: top-level MAD_ROCM_PATH in additional context → auto-detect (traditional /opt/rocm, versioned /opt/rocm-*, TheRock rocm-sdk markers, rocminfo/amd-smi/rocm-smi on PATH) → ROCM_PATH env var → /opt/rocm. Set MAD_AUTO_ROCM_PATH=0 to skip scanning.
  • Container: docker_env_vars.MAD_ROCM_PATH sets in-container ROCM_PATH (key consumed, not forwarded as-is); else resolved from OCI image config (docker image inspect) → in-image shell probe (docker run --rm) → default /opt/rocm with warning. The host-resolved path is not mirrored into the container by default.
  • Removed --rocm-path CLI flag: was an alias for MAD_ROCM_PATH but implied it covered both host and container paths. Use --additional-context instead:
    • Host: {"MAD_ROCM_PATH": "/path/to/host/rocm"}
    • Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}
  • Wired through Context, run orchestrator, and CLI/validators; README, configuration, usage, and CHANGELOG updated.
  • therock_detector script: prepends the package parent to sys.path so imports work when run as a file.
  • Unit tests (test_rocm_path, test_therock_markers) and integration GPU test adjustments.
  • Run phase environment table: prints host vs. container installation type, ROCm/CUDA root, and version side-by-side.

How to test

  • pytest tests/unit/test_rocm_path.py tests/unit/test_therock_markers.py (or full tests/unit/).
  • madengine run --help — confirm --rocm-path no longer appears; MAD_ROCM_PATH is set via --additional-context.
  • On a TheRock or versioned-ROCm image, run a minimal madengine run without MAD_ROCM_PATH and confirm a sensible ROCM_PATH in the generated context.

Checklist

  • --rocm-path CLI flag removed; interface is MAD_ROCM_PATH in --additional-context.
  • MAD_ROCM_PATH: null/blank in docker_env_vars treated as "unset" — host path not mirrored.
  • No behavior change for MAD_AUTO_ROCM_PATH=0 beyond documented legacy path.

Add rocm_path_resolver and shared therock_markers; wire Context, run CLI,
validators, and orchestrator. Document in README, configuration, and usage;
update CHANGELOG. Extend unit and integration tests. Adjust therock_detector
(sys.path helper) without ADR docs in this commit.

Made-with: Cursor
@coketaste coketaste self-assigned this Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 23:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution for madengine run, including auto-detection for versioned /opt/rocm-* and TheRock layouts, plus explicit host/container override semantics via MAD_ROCM_PATH.

Changes:

  • Introduces ROCm root detection/normalization helpers (rocm_path_resolver) and shared TheRock file-marker helpers (therock_markers).
  • Wires host/container ROCm-path precedence through Context, run_orchestrator, CLI validation, and help text.
  • Updates docs/tests (unit + integration) to reflect the new resolution behavior and TheRock markers.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker path helpers and detection.
tests/unit/test_rocm_path.py Adds/extends unit tests for host/container ROCm path precedence and auto-detect behavior.
tests/integration/test_gpu_management.py Adjusts integration tests to patch the correct vendor-detection import path and stabilizes a Context-based test with mocks.
src/madengine/utils/therock_markers.py Adds canonical TheRock marker relative paths and helper functions.
src/madengine/utils/rocm_path_resolver.py Implements host/container ROCm root resolution, auto-detect heuristics, and legacy compatibility functions.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared TheRock marker helpers and improves importability when run as a standalone script.
src/madengine/orchestration/run_orchestrator.py Defers ROCm path resolution to Context instead of pre-resolving in the orchestrator.
src/madengine/core/context.py Uses new resolver functions to compute host ROCm path and set container ROCM_PATH consistently.
src/madengine/core/constants.py Keeps get_rocm_path() as a legacy (non-auto-detect) wrapper delegating to the new resolver’s legacy function.
src/madengine/cli/validators.py Adds schema validation for host/container MAD_ROCM_PATH values in additional context.
src/madengine/cli/commands/run.py Updates --rocm-path help text to reflect new precedence/auto-detect behavior.
docs/usage.md Documents host vs container ROCm root overrides and auto-detect disabling.
docs/configuration.md Documents new precedence rules and override mechanisms for ROCm root resolution.
README.md Updates top-level docs to reflect new ROCm-path behavior and override knobs.
CHANGELOG.md Notes the new ROCm auto-detect + override behavior and references a design doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/cli/validators.py
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread docs/configuration.md Outdated
Comment thread CHANGELOG.md Outdated
coketaste and others added 2 commits April 22, 2026 19:56
… checks

- csv_parser.py: resolve rocm-smi via shutil.which() so TheRock images
  (where tools live in a Python venv, not /opt/rocm/bin/) are detected
  correctly; accept path_resolver to get ROCm version without hardcoding
  /opt/rocm/.info/version; add bounds check in NVIDIA GPU info parser
- rocenv_tool.py: pass path_resolver to CSVParser so version resolution
  uses RocmPathResolver.get_version() for both TheRock and traditional installs
- container_runner.py: replace host-resolved hardcoded amd-smi/rocm-smi
  paths in container exec with PATH-based resolution; add run phase
  environment table showing host vs container installation type, ROCm/CUDA
  root, and version side-by-side (supports apt install and therock layouts)
- docs: document run phase environment table in usage.md and configuration.md

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 02:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a centralized ROCm install-root resolution flow for the run phase (host + container), including auto-detection of versioned /opt/rocm-* and TheRock marker layouts, and wires it through context creation, container execution, CLI validation, and documentation.

Changes:

  • Introduces madengine.utils.rocm_path_resolver + madengine.utils.therock_markers and updates Context to resolve host ROCm root and propagate container ROCM_PATH with clear override precedence.
  • Updates runtime execution (container runner + rocEnvTool CSV output) to better handle TheRock-style layouts where tools are discovered via PATH.
  • Adds unit tests for TheRock markers and ROCm path resolution, adjusts integration tests, and updates CLI help + docs/README/CHANGELOG to document the new behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker-path helpers.
tests/unit/test_rocm_path.py Adds unit coverage for new host/container ROCm path resolution behavior and TheRock layouts.
tests/integration/test_gpu_management.py Updates integration mocking/patch points to match current vendor detection paths and stabilizes Context initialization in tests.
src/madengine/utils/therock_markers.py Defines shared TheRock on-disk marker locations and detection helper.
src/madengine/utils/rocm_path_resolver.py Implements host/container ROCm root resolution, auto-detect, and legacy compatibility functions.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared marker helpers; improves importability when run as a standalone script.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes path resolver into CSVParser so version resolution follows detected ROCm root.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH discovery for rocm-smi and uses resolver-based version when available.
src/madengine/orchestration/run_orchestrator.py Stops pre-resolving ROCm path in orchestrator; delegates to Context/resolver logic.
src/madengine/execution/container_runner.py Uses in-container PATH for SMI tools and prints a host vs container environment summary table.
src/madengine/core/context.py Centralizes ROCm path resolution and container ROCM_PATH propagation during runtime context init.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy behavior and delegates to get_rocm_path_legacy().
src/madengine/cli/validators.py Validates new MAD_ROCM_PATH fields in additional context (top-level + docker_env_vars).
src/madengine/cli/commands/run.py Updates --rocm-path help to match new precedence/auto-detect behavior.
docs/usage.md Documents host/container override behavior and the new run-phase environment table output.
docs/configuration.md Documents precedence, overrides, and references ADR for ROCm path resolution.
README.md Updates high-level ROCm path messaging to match new auto-detect + MAD_ROCM_PATH override model.
CHANGELOG.md Records new ROCm path resolution behavior and references design documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread docs/configuration.md Outdated
Comment thread CHANGELOG.md Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
coketaste and others added 3 commits April 22, 2026 21:48
…able

The previous shell command embedded \$(rocm-sdk path --root) inside
[ -f "..." ], which broke quoting when passed via docker exec bash -c "..."
causing the test to always fall through to 'unknown'.

Replace with a quoting-safe check: if rocm-sdk command exists and returns
a root path, report 'therock'; otherwise check /opt/rocm/.info/version
for traditional apt install layout.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Apply docker_env_vars overrides in context; finalize in run_container (AMD)
  from OCI image env, then docker run probe, then /opt/rocm with warning
- Update docs, README, CLI reference, CHANGELOG, and unit tests
- ADR 0001 left out of this commit (untracked)

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 24, 2026 02:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution for the run phase, including host-side auto-detection (TheRock + versioned /opt/rocm-*) and explicit host/container override behavior, and updates runtime logging + docs/tests accordingly.

Changes:

  • Introduces madengine.utils.rocm_path_resolver + madengine.utils.therock_markers to centralize ROCm/TheRock detection and documented precedence.
  • Updates runtime wiring (Context + ContainerRunner) so host ROCm root is resolved early, while container ROCM_PATH is finalized at Docker run time (OCI env → probe → fallback).
  • Adds/updates unit + integration tests and updates documentation/README/CHANGELOG to reflect the new precedence and behaviors.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py New host/container ROCm path resolution logic (auto-detect, overrides, OCI/probe finalization).
src/madengine/utils/therock_markers.py New shared TheRock marker path helpers.
src/madengine/core/context.py Uses new host resolver; stops mirroring host ROCm root into container env at init; applies container override mapping only.
src/madengine/execution/container_runner.py Finalizes container ROCM_PATH at run time; prints host/container environment summary; uses PATH-based SMI tool invocation in container.
src/madengine/orchestration/run_orchestrator.py Passes --rocm-path through to Context without legacy resolution.
src/madengine/core/constants.py get_rocm_path() becomes legacy wrapper delegating to resolver’s legacy behavior.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH in top-level context and docker_env_vars.
src/madengine/cli/commands/run.py Updates --rocm-path help text to describe new host behavior + auto-detect toggle.
src/madengine/scripts/common/tools/therock_detector.py Imports TheRock marker helpers reliably when run as a file.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Threads path_resolver into CSV parsing.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH-based rocm-smi detection; uses resolver for ROCm version; adds parsing guard.
tests/unit/test_rocm_path.py Adds coverage for new resolver APIs and revised Context/container behavior.
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker helpers.
tests/integration/test_gpu_management.py Fixes patch target for vendor auto-detection; stabilizes non-AMD context creation in integration tests.
docs/usage.md Documents host vs container ROCm path behavior + new run-phase environment table.
docs/configuration.md Documents new precedence and container finalization behavior (OCI/probe).
docs/installation.md Clarifies host vs container ROCm path configuration.
docs/cli-reference.md Updates CLI reference for --rocm-path and env vars.
README.md Updates high-level ROCm path behavior summary.
CHANGELOG.md Notes behavior change and new resolver modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py
Comment thread docs/cli-reference.md Outdated
coketaste and others added 2 commits April 24, 2026 03:08
MI350X (gfx950) was missing from the skip_gpu_arch fixture, causing
test_commandline_argument_skip_gpu_arch to fail on this machine.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…TH in --additional-context

The --rocm-path flag was an alias for top-level MAD_ROCM_PATH but its
help text implied it could set both host and container paths, risking
confusion when environments differ. Users should set host and container
ROCm roots independently via --additional-context:

  Host:      {"MAD_ROCM_PATH": "/path/to/host/rocm"}
  Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}

Also fixes datetime.utcnow() deprecation warnings in mongodb.py
(replaced with datetime.now(timezone.utc)).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 19:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ROCm root resolution flow that can auto-detect host ROCm installs (including versioned /opt/rocm-* and TheRock marker layouts) and cleanly separates host ROCm resolution from in-container ROCM_PATH selection at Docker run time.

Changes:

  • Added madengine.utils.rocm_path_resolver (host auto-detect + container finalization) and madengine.utils.therock_markers (shared TheRock marker paths).
  • Updated runtime wiring so Context resolves host ROCm root via the resolver, while ContainerRunner finalizes container ROCM_PATH at run time (OCI env → probe → default).
  • Removed madengine run --rocm-path usage in favor of MAD_ROCM_PATH (top-level and docker_env_vars), and updated docs/tests accordingly.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detect and container ROCM_PATH override/finalization logic.
src/madengine/utils/therock_markers.py Defines canonical TheRock marker relative paths + helper predicates.
src/madengine/core/context.py Switches host ROCm path resolution to the new resolver and defers container ROCM_PATH finalization.
src/madengine/execution/container_runner.py Finalizes in-container ROCM_PATH during Docker runs and adds a host/container environment summary table.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy resolution and delegates to resolver legacy helper.
src/madengine/cli/commands/run.py Removes --rocm-path CLI option wiring.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH (host and docker_env_vars) shapes in additional context.
src/madengine/scripts/common/tools/therock_detector.py Makes the tool importable when executed as a file; reuses shared marker helpers.
src/madengine/scripts/common/pre_scripts/rocEnvTool/{rocenv_tool.py,csv_parser.py} Passes resolver into CSV parsing and improves tool/path selection for TheRock-style layouts.
tests/unit/test_rocm_path.py Adds/updates unit coverage for resolver behaviors and container override rules.
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker helpers.
tests/integration/test_gpu_management.py Updates patch points and stabilizes integration behavior with Context initialization changes.
docs/{usage.md,installation.md,configuration.md,cli-reference.md} Updates documented precedence and configuration guidance for host vs container ROCm roots.
README.md Updates top-level ROCm path guidance and examples.
CHANGELOG.md Adds an Unreleased entry describing new ROCm resolution behavior.
src/madengine/database/mongodb.py Switches metadata timestamps to timezone-aware UTC.
tests/fixtures/dummy/models.json Adds gfx950 to skip list for a dummy model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Outdated
Comment thread CHANGELOG.md Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
coketaste and others added 2 commits April 24, 2026 16:22
…alongside is_therock_tree)

Conflict in container_runner.py: HEAD added is_therock_tree import for TheRock
environment table; develop added PERFORMANCE_LOG_PATTERN import from deployment.base.
Both imports are used in the file — keep both.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- validators.py: restrict docker_env_vars.MAD_ROCM_PATH to str|None only
  (was allowing int/float, inconsistent with error message)

- rocm_path_resolver.py: always pop MAD_ROCM_PATH from docker_env even
  when None, preventing the key from leaking into docker run env

- rocm_path_resolver.py: add bin/rocm-smi to _CONTAINER_PROBE_SH
  try_root() predicate to match Python-side looks_like_rocm_root()

- container_runner.py: clear ROCM_PATH before finalize_container_rocm_path
  each run so multi-model runs re-resolve per docker_image instead of
  reusing image 1's path for all subsequent images

- csv_parser.py: shlex.quote() the rocm-smi path before shell interpolation;
  only increment num_gpu after successful NVIDIA line parse (git-ignored
  path, committed separately if needed)

- README.md: replace curly/smart quotes with ASCII quotes in bash examples

- CHANGELOG.md: remove stale --rocm-path alias reference (flag was removed);
  remove broken docs/adr/ link

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 21:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ROCm root resolution system to better support TheRock/versioned ROCm layouts and clearly separate host vs container ROCm path overrides, wiring it through context initialization and Docker run-time environment setup.

Changes:

  • Introduces rocm_path_resolver (host auto-detect + container finalization) and therock_markers (shared marker paths).
  • Updates runtime flow to resolve host ROCm root in Context and set container ROCM_PATH at Docker run-time (OCI env → probe → default), plus prints a host/container environment summary table.
  • Removes --rocm-path and updates validators/tests/docs to use MAD_ROCM_PATH (host) and docker_env_vars.MAD_ROCM_PATH (container).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detect and container ROCM_PATH finalization logic.
src/madengine/utils/therock_markers.py Adds shared TheRock marker path helpers and detection.
src/madengine/core/context.py Uses new host ROCm resolver; stops mirroring host ROCM_PATH into container env at init.
src/madengine/execution/container_runner.py Finalizes container ROCM_PATH at run-time and prints host/container env summary.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy (no auto-detect) and delegates to resolver legacy helper.
src/madengine/cli/commands/run.py Removes --rocm-path option and associated wiring.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH types in additional context (host + docker_env_vars).
src/madengine/orchestration/run_orchestrator.py Stops passing legacy rocm_path into Context.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared marker helpers; makes imports work when run as a file.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes resolver into CSV generation for version lookup.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Improves tool discovery (PATH-based) and ROCm version collection fallback.
src/madengine/database/mongodb.py Stores upload timestamps using timezone-aware UTC datetimes.
tests/unit/test_rocm_path.py Adds/updates unit tests for resolver behavior and Context integration.
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker helpers.
tests/integration/test_gpu_management.py Updates mocks/patch targets for vendor auto-detection and Context init behavior.
tests/fixtures/dummy/models.json Updates dummy fixture skip list to include gfx950.
docs/configuration.md Documents new ROCm path precedence and overrides.
docs/usage.md Documents host/container ROCm path behavior and new env summary table.
docs/installation.md Updates install guidance to MAD_ROCM_PATH / container override behavior.
docs/cli-reference.md Removes --rocm-path references; documents new env/override keys.
README.md Updates top-level ROCm path feature description and examples.
CHANGELOG.md Notes new default host auto-detect + container ROCM_PATH resolution behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread docs/configuration.md Outdated
Comment thread docs/usage.md Outdated
Comment thread src/madengine/utils/therock_markers.py Outdated
coketaste and others added 3 commits April 24, 2026 20:50
…n logic

- Remove broken docs/adr/ links from configuration.md, usage.md,
  rocm_path_resolver.py docstring, and therock_markers.py comment
- Remove resolve_container_rocm_path() (thin wrapper with no unique logic)
- Remove early apply_container_rocm_path_overrides() call in init_gpu_context:
  it set ROCM_PATH only for run_container to immediately pop it; container
  ROCM_PATH finalization now happens exclusively in finalize_container_rocm_path
- Fix auto_detect() inner loop: first iteration was a duplicate
  _looks_like_rocm_root(root) check already done before the loop; walk
  to root.parent first, then check
- Silence grep -oP stderr in container ROCm version command (BusyBox images)
- Guard whitespace-only ROCM_PATH in apply_container_rocm_path_overrides
- Update tests to match: MAD_ROCM_PATH preserved in docker_env_vars at
  context init, consumed at run time by finalize_container_rocm_path

All 383 unit tests pass.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
)

context.py had a conflict in Context.__init__: develop added
detect_local_gpu_arch (for GPU arch auto-detection in full-run mode)
and rocm_path params; the PR branch dropped both in favour of
resolve_host_rocm_path(). Resolution: keep detect_local_gpu_arch
(still referenced by init_build_context) and drop rocm_path (replaced
by the PR's automatic host ROCm path resolution via resolve_host_rocm_path).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 21:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution to better support non-/opt/rocm layouts (notably TheRock / versioned installs) and cleanly separates host ROCm detection from in-container ROCM_PATH selection at Docker run time.

Changes:

  • Introduces madengine.utils.rocm_path_resolver (host auto-detect + container finalization) and shared TheRock marker helpers in madengine.utils.therock_markers.
  • Wires host ROCm resolution through Context and run execution, and finalizes container ROCM_PATH during Docker runs (OCI env → in-container probe → default).
  • Updates tests and documentation to reflect the new override keys (MAD_ROCM_PATH, docker_env_vars.MAD_ROCM_PATH) and the new run-time environment reporting.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker-path helpers and detection.
tests/unit/test_rocm_path.py Expands unit coverage for host/container ROCm resolution behavior and precedence.
tests/integration/test_gpu_management.py Updates integration tests to patch the correct vendor auto-detect import and stabilizes Context init in non-AMD scenarios.
tests/fixtures/dummy/models.json Updates dummy model GPU-arch skip list (adds gfx950).
src/madengine/utils/therock_markers.py Defines canonical TheRock marker paths and a minimal is_therock_tree() check.
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detect + container ROCM_PATH finalization (OCI/env/probe/default) and legacy wrappers.
src/madengine/scripts/common/tools/therock_detector.py Makes TheRock detection script importable when run as a file by amending sys.path; uses shared marker helpers with fallback.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes a resolver into CSV parsing so ROCm version reporting works across layouts.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH-based tool discovery and resolver-based ROCm version retrieval; adds minor robustness to parsing.
src/madengine/orchestration/run_orchestrator.py Removes explicit --rocm-path plumbing and relies on Context host resolution.
src/madengine/execution/container_runner.py Finalizes in-container ROCM_PATH at run time for AMD and prints a host vs container environment summary.
src/madengine/database/mongodb.py Switches metadata timestamps to timezone-aware UTC values.
src/madengine/core/context.py Replaces legacy ROCm path selection with resolve_host_rocm_path() and stops mirroring host path into container env at init time.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy-only and delegates to resolver’s legacy helper.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH types in additional context (top-level + docker_env_vars).
src/madengine/cli/commands/run.py Removes the --rocm-path CLI option and associated plumbing.
docs/usage.md Documents host vs container ROCm path behavior and the new run-phase environment table.
docs/installation.md Updates installation guidance for host ROCm overrides and clarifies container behavior.
docs/configuration.md Documents new ROCm path precedence and container resolution policy (OCI/env/probe/default).
docs/cli-reference.md Removes --rocm-path from CLI reference and documents new override keys and env vars.
README.md Updates README to reflect host auto-detect and container ROCM_PATH behavior.
CHANGELOG.md Notes the ROCm path resolution behavior change and the new override policy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/utils/rocm_path_resolver.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/cli/commands/run.py
coketaste and others added 3 commits April 27, 2026 16:23
apply_container_rocm_path_overrides() was falling back to host_path
when docker_env_vars.MAD_ROCM_PATH was None or blank, silently mirroring
the host ROCm root into the container — contradicting the documented
"no host mirroring by default" policy.

Fix: after popping the key (always consumed), return None immediately for
None/blank values instead of substituting host_path.

Add test_apply_overrides_null_mad_rocm_path_does_not_mirror_host to
cover None, "", and whitespace-only values.

Resolves Copilot review comment on PR #110.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
The directory was untracked via git rm --cached in c6370bb but never
added to .gitignore, leaving it one accidental "git add ." away from
being re-committed.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 21:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm install-root resolution with separate host vs in-container handling, including TheRock marker support, updated CLI/docs, and test coverage updates across unit + integration tests.

Changes:

  • Introduces host ROCm root auto-detection + overrides via top-level MAD_ROCM_PATH, plus container-time ROCM_PATH finalization (OCI env → probe → default).
  • Removes --rocm-path from madengine run, routing configuration through --additional-context (host + docker_env_vars).
  • Updates runtime logging/diagnostics (run env table), tooling scripts, validators, and test suites to match the new resolution behavior.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit tests validating TheRock marker-path helpers and detection behavior.
tests/unit/test_rocm_path.py Expands unit coverage for host/container ROCm path resolution, MAD_ROCM_PATH precedence, and container finalization behavior.
tests/integration/test_gpu_management.py Fixes patch target for vendor auto-detect and stabilizes a non-AMD renderD test setup.
tests/fixtures/dummy/models.json Updates dummy fixture to skip gfx950 in skip_gpu_arch.
src/madengine/utils/therock_markers.py Adds shared TheRock marker constants + helper functions for consistent detection.
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detection, container ROCM_PATH finalization, and helper APIs.
src/madengine/scripts/common/tools/therock_detector.py Improves script importability when executed as a file; reuses shared TheRock marker helpers.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes a path resolver into CSV parsing to support TheRock layouts/version lookup.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH-based rocm-smi discovery and safer parsing for GPU info + ROCm version lookup.
src/madengine/orchestration/run_orchestrator.py Stops threading a legacy rocm_path arg into Context; relies on Context resolution.
src/madengine/execution/container_runner.py Finalizes container ROCM_PATH at run time and adds a host-vs-container environment summary table.
src/madengine/database/mongodb.py Switches timestamps to timezone-aware UTC to avoid utcnow() deprecation warnings.
src/madengine/core/context.py Resolves host ROCm root via resolve_host_rocm_path; stops mirroring host ROCM_PATH into container env at init.
src/madengine/core/constants.py Re-scopes get_rocm_path() as legacy resolution (no auto-detect) delegating to the new resolver helper.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH types in additional context (host + docker_env_vars).
src/madengine/cli/commands/run.py Removes --rocm-path flag from the run command.
docs/usage.md Documents host vs container ROCm path behavior and the run-phase environment table output.
docs/installation.md Updates guidance to use MAD_ROCM_PATH (host) and docker_env_vars.MAD_ROCM_PATH (container).
docs/configuration.md Documents new resolution precedence and container-time ROCM_PATH finalization behavior.
docs/cli-reference.md Removes --rocm-path from reference and updates examples for MAD_ROCM_PATH usage.
README.md Updates top-level ROCm path messaging and examples to match new behavior.
CHANGELOG.md Records new auto-detect behavior, container resolution, CLI flag removal, and related fixes.
.gitignore Ignores run_directory/ artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/utils/rocm_path_resolver.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
@coketaste coketaste requested a review from Copilot April 27, 2026 21:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ROCm root resolution pipeline to clearly separate host ROCm root detection from in-container ROCM_PATH resolution, and removes the ambiguous --rocm-path CLI flag in favor of MAD_ROCM_PATH via --additional-context.

Changes:

  • Introduces rocm_path_resolver + therock_markers utilities and threads host ROCm root resolution through Context.
  • Finalizes container ROCM_PATH during Docker run (OCI env → in-container probe → default), and updates runtime logging to show host vs container environment details.
  • Updates CLI/docs/changelog, adjusts TheRock detection tooling, and extends unit/integration tests accordingly.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker path helpers.
tests/unit/test_rocm_path.py Updates/extends unit tests for MAD_ROCM_PATH + auto-detect + container finalization behavior.
tests/integration/test_gpu_management.py Fixes vendor auto-detect patch target; stabilizes Context integration assertions via mocking.
tests/fixtures/dummy/models.json Adds gfx950 to the skip list for dummy fixture coverage.
src/madengine/utils/therock_markers.py Adds shared TheRock marker path helpers (share/therock/...).
src/madengine/utils/rocm_path_resolver.py Implements host auto-detect + container finalization logic (OCI env + probe) and override handling.
src/madengine/scripts/common/tools/therock_detector.py Makes marker imports work when executed as a standalone file by adjusting sys.path; reuses marker helpers.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes path_resolver into CSVParser so ROCm version works for TheRock and traditional layouts.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Resolves rocm-smi via PATH first; uses resolver for version; adds bounds-checking in NVIDIA parsing.
src/madengine/orchestration/run_orchestrator.py Removes --rocm-path wiring; relies on Context’s host ROCm resolution.
src/madengine/execution/container_runner.py Finalizes in-container ROCM_PATH at runtime; switches AMD SMI checks to PATH-based; prints host vs container env table.
src/madengine/database/mongodb.py Replaces datetime.utcnow() with datetime.now(timezone.utc).
src/madengine/core/context.py Resolves host ROCm root via resolve_host_rocm_path; stops setting container ROCM_PATH during context init.
src/madengine/core/constants.py Keeps get_rocm_path() as legacy (no auto-scan), delegating to resolver’s legacy helper.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH types in additional context (top-level and docker_env_vars).
src/madengine/cli/commands/run.py Removes the --rocm-path option and associated plumbing.
docs/usage.md Documents host vs container ROCm behavior and the new run-phase environment table.
docs/installation.md Updates non-default ROCm guidance to use MAD_ROCM_PATH semantics and clarifies container behavior.
docs/configuration.md Documents host auto-detect, override keys, and container precedence.
docs/cli-reference.md Removes --rocm-path references; adds MAD_ROCM_PATH examples.
README.md Updates top-level ROCm path guidance to match new host/container split.
CHANGELOG.md Documents new ROCm path resolution behavior and removal of --rocm-path.
.gitignore Ignores run_directory/ output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1062 to +1068
# Clear any ROCM_PATH left from a previous model run so finalize always
# re-resolves per docker_image (OCI config → probe → default).
# ROCM_PATH is program-managed; users override via MAD_ROCM_PATH in
# additional_context.docker_env_vars, which was merged just above and
# will be consumed by finalize_container_rocm_path correctly.
self.context.ctx["docker_env_vars"].pop("ROCM_PATH", None)

Comment on lines +83 to +88
host_rocm_root = getattr(context, "_rocm_path", None) or "unknown"
host_install_type = (
"therock"
if is_therock_tree(Path(host_rocm_root))
else "apt install"
)
Comment on lines +433 to +434
if existing not in (None, ""):
croot = normalize_rocm_path(str(existing).strip())
Comment thread docs/configuration.md
Comment on lines +139 to +145
- **Additional context (container):** `"docker_env_vars": { "MAD_ROCM_PATH": "/path/inside/image" }` — sets the in-container `ROCM_PATH` for Docker runs. If omitted, at `run` time madengine uses the image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) if present, then an in-container probe, then defaults to `/opt/rocm`. The host-resolved path is **not** mirrored into the container.

These two keys are independent, allowing host and container to use different ROCm installations without confusion.

Precedence (host): top-level `MAD_ROCM_PATH` → auto-detect (unless disabled) → `ROCM_PATH` → `/opt/rocm`.

Precedence (container, **local Docker `run`**, **AMD**): `docker_env_vars.MAD_ROCM_PATH` (maps to `ROCM_PATH` for the workload) or explicit `ROCM_PATH` in `docker_env_vars` → image OCI `Env` (`ROCM_PATH` / `ROCM_HOME`) → in-image probe → default `/opt/rocm` with a warning. Implemented in `ContainerRunner.run_container` after the run image is resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants